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Abstract 

Background: Cell-specific gene expression is controlled by epigenetic modifications and transcription factor 
binding. While genome-wide maps for these protein-DNA interactions have become widely available, quantitative 
comparison of the resulting ChlP-Seq data sets remains challenging. Current approaches to detect differentially 
bound or modified regions are mainly borrowed from RNA-Seq data analysis, thus focusing on total counts of 
fragments mapped to a region, ignoring any information encoded in the shape of the peaks. 

Results: Here, we present MMDiff, a robust, broadly applicable method for detecting differences between sequence 
count data sets. Based on quantifying shape changes in signal profiles, it overcomes challenges imposed by the 
highly structured nature of the data and the paucity of replicates. 

We first use a simulated data set to compare the performance of MMDiff with results obtained by four alternative 
methods. We demonstrate that MMDiff excels when peak profiles change between samples. We next use MMDiff to 
re-analyse a recent data set of the histone modification H3K4me3 elucidating the establishment of this prominent 
epigenomic marker. Our empirical analysis shows that the method yields reproducible results across experiments, and 
is able to detect functional important changes in histone modifications. To further explore the broader applicability of 
MMDiff, we apply it to two ENCODE data sets: one investigating the histone modification H3K27ac and one measuring 
the genome-wide binding of the transcription factor CTCF. In both cases, MMDiff proves to be complementary to 
count-based methods. In addition, we can show that MMDiff is capable of directly detecting changes of homotypic 
binding events at neighbouring binding sites. MMDiff is readily available as a Bioconductor package. 

Conclusions: Our results demonstrate that higher order features of ChlP-Seq peaks carry relevant and often 
complementary information to total counts, and hence are important in assessing differential histone modifications 
and transcription factor binding. We have developed a new computational method, MMDiff, that is capable of 
exploring these features and therefore closes an existing gap in the analysis of ChlP-Seq data sets. 

Keywords: Chip-Seq, Differential peak detection, Kernel methods, Maximum mean discrepancy, 
Histone modifications, H3K4me3, Cfpl 



Background 

Chromatin immunoprecipitation followed by deep se- 
quencing (ChlP-Seq) is rapidly becoming the main 
experimental technique in functional genomic and epige- 
nomic studies. ChlP-Seqs ability to profile genome-wide 
patterns of transcription factor binding and histone mod- 
ifications has led to its extensive use by the ENCODE 
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consortium [1] in an endeavour to identify and char- 
acterise all functional elements encoded in the human 
genome. 

Despite the widespread use of ChlP-Seq, data analysis 
is still a challenging task [2] and a typical computational 
pipeline includes a number of steps, each posing its own 
difficulties. An initial crucial step is the identification of 
regions with significant signal enrichment relative to a 
control sample in a process called peak calling. Over the 
last years, several tools for this task have been suggested 
and they have recently been compared in [3]. As a result of 
peak calling, genome-wide catalogues are obtained, which 
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provide valuable snapshots of protein binding or histone 
modifications in a given cell or tissue. 

However, to understand the dynamics of histone modi- 
fications and TF binding and their effects on cell-specific 
gene regulation it is necessary to quantitatively compare 
different ChlP-Seq samples. This is a surprisingly difficult 
task as the statistical assessment of differences is hindered 
by a number of factors: on the one hand, the data is digital, 
consisting of counts of DNA fragments (reads) mapped 
onto regions of the genome. This feature, common to all 
sequencing-based methods, raises the immediate issue of 
choosing a suitable noise model for both technical and 
biological noise. On the other hand, in most studies, only 
a very small number of replicate experiments are per- 
formed, making statistical testing an intrinsically difficult 
task. To compound both of these problems, ChlP-Seq pro- 
duces spatially distributed patterns of binding or histone 
modifications localised to specific regions of the genome 
(peaks); this feature, in particular, renders standard dif- 
ferential testing methods unsuited for the comparison of 
ChlP-Seq data sets. 

Currently, two strategies are predominantly followed 
for the differential analysis of ChlP-Seq data sets: The 
most naive approach is to identify overlaps in the sets 
of genomic peak intervals detected in the different sam- 
ples, e.g., [4-6]. This simplifies the problem to a basic 
occupancy analysis which is insensitive to changes in the 
affinity of TF binding or in the prevalence of histone 
modifications. In addition, the results are strongly depen- 
dent on the thresholds which are set heuristically in the 
peak calling step and differences in the noise background 
may further confound the outcome of this analysis. An 
alternative strategy is to compute the total number of 
reads mapping to each peak in each data set and to 
test for significant fold-changes across multiple tissues or 
conditions, e.g., [7]. These count-based approaches have 
mostly advocated the adaptation of methods for RNA- 
Seq data analysis to the more structured ChlP-Seq data. 
For example, the frequently used methods DBChIP [8] 
and DiffBind [7] are based on the RNA-Seq methods 
DESeq [9] and EdgeR [10]. They employ a negative bino- 
mial distribution to model both biological and technical 
noise in the total counts of expressed genes. To circum- 
vent the problems of low experimental replication, they 
apply an elegant approach in which information is shared 
across genes, effectively pooling together genes with simi- 
lar total counts. An immediate problem arising for count- 
based methods is finding the right normalisation. Initially, 
data sets were rescaled according to the observed library 
size, which corresponds to the total number of reads in 
the whole data set [11-13]. However, it has been shown 
that this strategy is inadequate in most situations, and 
a number of alternatives have been suggested, including 
rescaling to the median of the ratios of observed counts 



[9,14], locally weighted regression (LOWESS) [15] and 
more recently rescaling using common peaks across data 
sets (MANorm, [16]). All these methods make strong a 
priori assumptions about the relationship of the data sets 
that are to be compared. The choice of the normalisa- 
tion method can therefore greatly influences the results of 
count-based differential analysis [14,17,18]. 

Perhaps a more severe limitation of count-based meth- 
ods is the information loss inherent in representing a peak 
by a single integer (the total counts of reads mapping 
into the given peak region). Any higher order informa- 
tion that is conveyed in the peaks is ignored. However, 
a spatial structure of the ChlP-Seq signal is particularly 
evident in the case of peaks associated with epigenomic 
marks. For example, trimethylation of lysine 4 on histone 
H3 (H3K4me3) is known to form distinct bimodal peaks 
at transcription start sites (TSS), e.g. [19]. Interestingly, at 
a given genomic location the shape of observed enrich- 
ment peaks tend to be highly reproducible across bio- 
logical replicates and increasing evidence hints towards a 
functional role of these profile structures [1,20]. Focus- 
ing exclusively on total counts of reads associated with 
a peak might therefore be insufficient when investigating 
differences of epigenomic modifications between differ- 
ent samples and higher order features associated with the 
shape of an enrichment peak should also be taken into 
account. 

In this paper, we introduce MMDiff, a multivariate non- 
parametric approach to testing significant differences in 
profile patterns between peaks in different conditions. 
In contrast to count-based methods, which make their 
decision by comparing a single number, i.e. counts, MMD- 
iff exploits higher order features in the peak shapes. By 
focusing on shape differences, MMDiff accounts explicitly 
for the spatial structure of ChlP-Seq peaks; this also makes 
it more robust to normalisation effects and independent 
of the explicit definition of a noise model. The underly- 
ing idea is to treat each peak as a distribution over a finite 
space given by the starting positions of all reads. The prob- 
lem of testing for differential binding is then reduced to 
testing whether two samples are generated by the same 
probability distribution (albeit unknown). In this context 
a sample consists of all the reads mapping to a given peak 
region in one data set. As there is a large variability of 
observed peak profiles at different genomic locations - 
some may weakly resemble a Gaussian distribution, how- 
ever most are strongly skewed and/or multi-modal (see 
Figure 1) - we cannot make any assumption about the 
type of distribution. We therefore adopt recent advances 
in machine learning research [21,22], which enable us 
to include features of any order in the prediction of 
differential binding without making assumptions of the 
underlying distributions. MMDiff is specifically designed 
to detect differences between different ChlP-Seq data sets, 
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Figure 1 H3K4me3 profiles at tiiree different transcription start sites. Profiles in A and B sliow a typical bimoodal structure, while the peak 
displayed in C is more complex. Data from three different samples {WT, Resc, Cfpl -/-) and measured in two repeat experiments (AB.l : upper panel 
and AB.2: lower panel) are shown. Arrows indicate transcription starts sites and direction of transcription. Shown are normalised read counts. Note, 
that in contrast to coverage plots, reads are here only represented by their estimated mid points. The patterns for WT and Resc strongly resemble 
each other and while the signal in experiment AB.2 is noisier than in AB.l , the overall shapes are very similar. In the Cfpl -/- sample read coverage 
appears to be reduced in parts of the regions. However, the second example shows that a decrease in one part of the region can be compensated 
for by a gain of signal in an upstream region. All three examples were consistently called by MMDiff in both experiments, but not called by any other 
method. 



however, its main idea can also be used to address the 
more general problem of detecting differences in other 
sequencing based experiments, for example in DNase- 
Seq or CAGE-Seq data sets. Recently, a similar approach 
has been employed for the detection of differential RNA 
isoforms from RNA-Seq data [23]. 

We illustrate and compare our method on a simulation 
study, and on three independent ChlP-seq data sets of 
both transcription factor binding and epigenomic mod- 
ifications. Our results show that MMDiff can capture 
biologically meaningful changes and is highly comple- 
mentary to count-based approaches. We propose that 
MMDiff provides an important new tool for bioinfor- 
maticians and biologists interested in epigenomic data 
analysis, conveniently available as a Bioconductor tool. 

The rest of the paper is organised as follows: we start 
with a discription of the statistical foundations of our 
method and a discussion on how the MMD statistic of 
[21] is modified to account for biological variability. We 
complete the Methods section with a thorough simulation 
study which compares the results of our method to four 
different competitors in a controlled environment. This 
enables us to discuss the strengths and weaknesses of the 



various methods, and in particular highlights the com- 
plementarity of the MMDiff approach w.r.t. count-based 
methods. We then present results on three different data 
sets: we start with an in-depth analysis on the H3K4me3 
data set of [24]. As this study constitutes our main bio- 
logical motivation, we present multiple complementary 
analyses that demonstrate the functional significance of 
our results. To establish the broad applicability of our 
method, we also present results on two ENCODE data 
sets: a comparison of the histone mark H3K27ac across 
different human cell lines (K562 and GM12878), and a 
comparison of binding patterns of the transcription factor 
CTCF across different mouse tissues (cortex, cerebellum 
and liver). We conclude the paper with a broader discus- 
sion of the method in the context of NGS data analysis. 

Methods 

Kernel-based statistical tests 

In order to incorporate shape features in a statistical test- 
ing procedure, we adopt a kernel-based non-parametric 
test, which allows us to retain information of any order 
within the testing procedure [21,22]. We briefly review 
here the mathematical foundations of this procedure. 
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The statistical testing question we wish to address is the 
following: Suppose for a peak / we are given m = n^^ obser- 
vations (i.e. reads) in data set 5, := (Xp . . . ,x^) and 
n = observations in data set 5^ X^' := (x(, . . . ,x^'), 
where x^ and x^ are random variables with respective 
probability measures p and p\ and X^ and X^ are inde- 
pendently and identically distributed (i.i.d.) from p and p^, 
respectively. Can we decide at a given significance level to 
reject the null hypothesis p = p'? 

In order to decide this question, we will first define a 
proper test statistic that summarise the observations while 
at the same time retaining higher order information of the 
distributions. We will therefore employ Kernel methods, 
and use positive definite kernels to capture non-linearity 
of the original data through the higher-order moments. As 
with all kernel-based methods, the starting point for this 
approach is to define a feature map 0 (x) which maps the 
data into a high dimensional reproducing Kernel Hilbert 
Space (RKHS). While the dimension of the RKHS is usu- 
ally very high (or even infinite), all relevant quantities 
are determined in terms of inner products (in the RKHS) 
between feature vectors, and can be efficiently computed 
in terms of a finite number of evaluations of the kernel 
function 

/:(x,xO = (0(x),0(xO). 

In the RKHS, the mean element of a distribution p con- 
tains the information of all higher-order moments and we 
can compute the empirical estimates (/x^, jl^ ) of the mean 
elements for X^, X^ as 



m 

m ^ 



(1) 



i=l 



and /x^ respectively. Furthermore, we can use the distance 
between the mean elements of two distributions p^p' , 
{the maximum mean discrepancy, MMD) as test statistics. 
Intuitively, the greater the distance, the more different the 
two distributions are. For a given peak /, the dissimilar- 
ity between data set s and 5^ can therefore be expressed in 
terms of the MMD value: 



^ m ^ m,n 



i,)=l 



i,j=l 



(2) 



A modelling issue of central importance is the choice 
of the features and the kernel function k. In our case, 
we wish to preserve the spatial information contained 
in the peak profile. We therefore used the estimated 
mid positions of the mapped reads as observed features 
and the radial basis function (RBF) as kernel /:(x,xO = 



exp[—(x — x^)^/(2cr^)'\. The (hyper)-parameter a con- 
trols the length scale of the kernel, i.e. the distance (along 
the genome) at which fragment counts decorrelate. In our 
experiments, we used a heuristic suggested in [22] such 
that = 1/2 ' x^y where x is the median distance of all 
observations in X^ and X^ . 

Accounting for biological variability 

The bootstrap procedure for computing MMD statistics 
proposed in [21] has strong theoretical guarantees for 
discriminating between different distributions, given suf- 
ficient number of samples (i.e. reads mapped to a peak). 
A simulation study shows that the procedure appears to 
be well calibrated when comparing technical replicates of 
ChlP-Seq data (see Additional file 1). However, biological 
variability implies that the histogram distributions of the 
same peak in different biological repUcates will be more 
different than expected. This turns out to be true, and 
the testing procedure of [21] rejects the null hypothesis in 
almost all comparisons between biological replicates (see 
Additional file 1). 

In order to avoid this problem, we adopt a data-driven 
method to estimate biological variability from biological 
replicates. In general, this is a difficult task, as for most 
experiments only very few replicates are available; for 
example the ENCODE consortium set a standard of two 
independent biological replicates per ChlP-Seq measure- 
ment [25] . A reliable estimate of biological variability on 
a peak by peak basis is therefore rarely possible. To obvi- 
ate this problem, we pool peaks with similar total counts 
to generate robust estimates of /^-values (this information 
sharing is similar in spirit to the regression approach of 
DESeq, [9]). Specifically, for each peak / we determine 
the number hi of reads mapping to it averaged across all 
considered samples. Peaks are then binned into quantiles 
determined on the averaged counts per peak. To obtain 
empirical /7-values we compute the probability of observ- 
ing an MMD value between biological replicates in the 
given bin, which is at least as large as the one observed for 
a given peak in the comparison between conditions. Raw 
/^-values are subsequently corrected for multiple testing 
using the method of Benjamini and Hochberg [26]. 

Simulation study 

To benchmark the performance of our method in a quan- 
titative manner we initially resort to simulations. While 
simulations are necessarily limited in their biological real- 
ism, we think the availability of a ground truth is impor- 
tant for fair assessments, and the possibility of varying 
simulation parameters provides an excellent opportunity 
to explore the methods strengths and limitations. The 
strategy we follow to generate an artificial set of ChlP- 
Seq peaks is illustrated in Figure 2: we consider a control 
set of 10,000 simulated peaks. To assign a total count 
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Figure 2 Simulated ChlP-Seq experiment. A: MA-plots for simulated peaks; Each dot corresponds to a single peak. Black dots, green circles and 
purple crosses indicate unchanged sites, sites with changed profiles and sites with affinity changes, respectively. The left plot shows changes in base 
affinity in treatment vs control as a function of mean peak affinity, no biological variability and no sequencing effects are considered. In contrast, the 
right panel results if biological variance (Gamma distributed) and sampling of reads (Poisson distributed) are simulated. In this case, sites with 
unchanged base affinity may still show substantial fold changes, which hampers the detection of true differential sites. The filled green circle 
marked by an arrow corresponds to the profile depicted in detail in B: Simulated example profiles (mixtures of two Gaussian curves) with profile 
change simulated as a change in the mixing parameter. Left panels correspond to the control condition, right panels to the treatment condition. 
First row shows three peak profiles for each condition and the area under the curves integrates to 1 . Within each condition there is a small degree of 
variability regarding the position and width of the two sub-peaks and also their relative strength. Between conditions the mixing parameter 
changes substantially. In the middle row, each of the six profiles is weighted with the sample specific affinity value for the given peak. The areas 
under the curves now vary between samples. In the bottom row, the sequencing process is simulated with a Poisson distribution resulting in 
histograms of reads mapping along the extend of the peak. C: Receiver operator characteristic (ROC) curves for various methods. Left: only 
unchanged sites and sites with profile changes are considered; Right: only unchanged sites and sites with affinity changes are used. Circles indicate 
the considered operating point (FDR=0.05). 



to each peak, we follow the negative binomial (NB) gen- 
erative model, as suggested elsewhere [9,27]. This com- 
monly used hierarchical model effectively assumes that 
the between-sample variation follows a gamma distri- 
bution while the sequencing process leads to a Poisson 
distribution. We start by assigning a true base affinity 
value to each peak. These 'genomewide' affinity values are 
sampled according to the distribution of total counts in 
an ENCODE CTCF data set [28]. To simulate biological 
replicates, we generate sample-specific affinity values for 
each peak according to a Gamma distribution with mean 
value given by the true base affinity for that peak. The spa- 
tial structure of the peaks (peak profiles) is assumed to 
be bimodal, modelled as a mixture of two Gaussians with 
varying base means, variances and mixing parameters. 
The 'biological noise' in the peak profiles is modelled by 
sampling means, variances and mixing parameters from 
Gaussian distributions with means given by the true base 
values. To generate a 'treatment' set, we randomly chose 
100 peaks and introduce changes in their base affinity val- 
ues (Figure 2A). Likewise, we chose 100 peaks to change 
their base profile by varying the base mixing parameter 
(Figure 2B). We again create 'biological replicates' for the 
treatment condition. To obtain resulting affinity profiles' 
for a given peak, we have to multiply the local distribu- 
tions given by the peak profile with the peak's affinity 
value. To simulate the sequencing process, reads mapping 
to a peak are then sampled according to a Poisson dis- 
tribution. For simplicity, we assume the same library size 
for each sample and also that the overall enrichment of 
all peaks relative to the genomic background is identical 
in all samples. To assess the robustness of the methods' 
predictions, we repeated this procedure 10 times. 

As competitors for our method, we selected three 
count-based methods, DESeq [9], DBChip [8] and DIME 
[15]. Additionally, we investigate another shape-based 
method, where we suggest to replace the MMD distance in 
our method with the Generalised Mover Distance (GMD) 
which was recently suggested as a measure of the distance 



between histograms [29]. In Table 1, we report results 
on the affinity changes and on the profile changes sepa- 
rately. We summarize performances at false discovery rate 
(FDR) of 0.05, and also report the area under the Receiver 
Operating Characteristic (auROC) curve (Figure 2C). As 
expected, count-based methods cannot capture shape- 
based changes, with DESeq, DBChip and DIME all calling 
very few peaks essentially at random. On the contrary, 
MMDiff 's performance is overall very good, with a very 
low number of false positives. Interestingly, GMD is seen 
to perform globally as well as MMDiff, however its per- 
formance at the selected operating point (very high speci- 
ficity) is considerably worse. When we consider affinity 
changes all three count-based methods achieve very good 
results (particularly so for DIME and DBChip). MMDiff 's 
performance is considerably worse, while still significantly 
better than random; in particular, the number of false pos- 
itives called is very limited (c.f. GMD's high number of 
false positives). Therefore, MMDiff appears to be well cal- 
ibrated, with good power to capture profile changes and 
avoiding type I errors when dealing with count changes. 
In summary, MMDiff proves to be complementary to 
count-based methods, as expected. For a most exhausted 
analysis of differential regions that captures both types of 
changes we therefore suggest to combine MMDiff with a 
count-based method. 

Results and discussion 

Application 1: H3K4me3 data set 

We first used our method MMDiff to examine a ChlP- 
Seq data set investigating the epigenetic mark H3K4me3 
[24]. This study particularly focused on the question of 
how profiles of this mark are shaped by Cfpl, which is 
known to be a conserved DNA-binding subunit of the 
H3K4 histone methyltransferase (HMT) Setl complex. 
The experiment presented consists of ChlP-Seq measure- 
ments from three different cell lines: (1) a wild- type mouse 
ES cell line (WT), (2) a mutant ES line lacking Cfpl (Cfjpl-/-) 
[30,31], and (3) a rescue cell line obtained by stable 
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Table 1 Differential peak calling on simulated data 



Profile changes 





TP 


FP 


FN 


TN 


eFDR (%) 


SN (%) 


SP (%) 


auROC (%) 


DESeq 


0 + 0.0 


0.7 + 0.8 


100 + 0.0 


9799.3 + 0.8 


NaN 


0 + 0.0 


100 + 0 


50 + 0 


DBChIP 


0.1 +0.3 


2.4+1.7 


99.9 + 0.3 


9797.6 + 1 .7 


NaN 


0.1 +0.3 


100 + 0 


50 + 0 


DIME 


0 + 0.0 


1.4+1.0 


100 + 0.0 


9798.6 + 1 .0 


NaN 


0 + 0 


100 + 0 


50 + 0 


GMD 


17.8 + 9.1 


5.5 + 2.9 


82.2 + 9.1 


9794.5 + 2.9 


26+10 


17.8 + 9.1 


99.9 + 0 


83 + 0 


MMDiff 


34.6 + 4.1 


0.7 + 0.8 


65.4 + 4.1 


9799.3 + 0.8 


2 + 0 


34.6 + 4.1 


100 + 0 


83 + 0 



Affinity changes 





TP 


FP 


FN 


TN 


eFDR (%) 


SN (%) 


SP (%) 


auROC (%) 


DESeq 


27.0 + 5.6 


0.7 + 0.8 


73.0 + 5.6 


9799.3 + 0.8 


2 + 0 


27.0 + 5.6 


100 + 0 


81+0 


DBChIP 


50.1 + 3.8 


2.4+1.7 


49.9 + 3.8 


9797.6+1.7 


4 + 0 


50.1 +3.8 


100 + 0 


94 + 0 


DIME 


45.3 + 4.1 


1 .4 + 1 .0 


54.7 + 4.1 


9798.6 + 1 .0 


3 + 0 


45.3 + 4.1 


100 + 0 


95 + 0 


GMD 


2.1 +2.1 


5.5 + 2.9 


97.9 + 2.1 


9794.5 + 2.9 


73 + 30 


2.1 +2.1 


99.9 + 0 


60+10 


MMDiff 


2.5 + 1.5 


0.7 + 0.8 


97.5 + 1.5 


9799.3 + 0.8 


NaN 


2.5 + 1 .5 


100 + 0 


70 + 0 



Performance summary of five different methods on ten runs of simulated data sets. In the upper panel unchanged sites and sites with profile changes are considered, 
in the lower panel unchanged sites and sites with affinity changes. FDR threshold: 0.05; TP: true positives, FP: false positives, FN: false negatives, TN: true negatives, 
eFDR: empirical FDR {FP/(FP+TP)), SN: sensitivity, SP: specificity, auROC: area under ROC. 



transfection of a human Cfpl cDNA into Cfpl-/- ES cells 
(Resc) [32,33]. We expected that H3K4me3 is reduced in 
the Cfpl-/- cells. However, as the H3K4 specific HMT 
activity is redundantly encoded in at least six different 
complexes in mammals, the precise target regions of Cfpl 
were unknown [34]. In addition, under the assumption 
that the different enzymes potentially act cooperatively 
at the same target regions, we expected that this his- 
tone modification would not be completely abolished at 
these regions but rather reduced, potentially leading to 
altered peak profiles. In [24], it was confirmed that Cfpl 
is expressed at near endogenous levels in the rescue cell 
line and that the H3K4me3 levels are comparable to the 
levels observed in WT. To detect changes that are pri- 
marily due to the absence of Cfpl, we will thus contrast 
the variability between WT and Resc with the observed 
changes between WT and Cfpl-/-. Effectively using the 
Resc sample as a biological replicate for the control group 
will lead to a potential over-estimation of biological vari- 
ation resulting in a conservative estimate of differential 
H3K4me3 patterns. 

Clouaire et al. repeated the complete experiment on 
biological replicates [24]. The antibodies used (here 
abbreviated with AB.l and AB.2) have slightly differ- 
ent specificities, plausibly resulting in different signal to 
noise ratios and the two experiments were therefore anal- 
ysed independently as two repeat experiments. We report 
results obtained by MMDiff and compare them with 
results obtained using DESeq as it is the most widely used 
count-based methods. 



Peak finding 

We started our analysis by identifying genomic regions 
that are significantly enriched for H3K4me3 modifica- 
tions. We used the software package MACS on each 
of the data sets [11] and subsequently created a set of 
67,035 MACS consensus peaks from regions overlapping 
in at least three data sets. We found that only 24% of 
these peaks overlapped with 4kb windows around TSSs. 
However, around 70% of reads mapping to peaks in WT 
were found in these promoter proximal peaks. This is in 
good agreement with the fact that H3K4me3 is known 
to localise around transcription start sites [19]. We con- 
clude that in addition to the promoter proximal regions, 
MACS calls a large number of narrower, low coverage 
peaks, which are potentially more likely to be spurious. 
We therefore complement our analysis by investigating 
27,807 promoter regions defined by known annotated 
genes. Note that in WT about half of these promoters 
show only small enrichment for H3K4me3. 

To ensure comparability of the data sets, we corrected 
for different sampling depths using the normalisation 
method suggested in [9]. For simplicity, we will refer to the 
normalised number of reads mapping to peak / in sample 
s as the total counts, nj. The full pre-processing pipeline is 
described in detail in the Additional file 1 which also con- 
tains further initial analysis demonstrating that the data 
sets are only weakly affected by input biases and other 
biases such as GC content [35]. 

Resulting ChlP-Seq signals at three promoter regions 
are shown in Figure 1. The shapes of the peaks are 
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remarkably well conserved between WT and Resc and 
also between the two experiments, confirming our moti- 
vation to exploit shape conservation between replicates 
to increase the sensitivity of differential tests. In general, 
we see a signal decrease in the Cfpl-/- cells as com- 
pared to WT/Resc, as expected. However, these changes 
often appear to be highly spatially dependent: for exam- 
ple the profiles in Figure lA and C are only affected 
downstream of the promoter. Interestingly, the profiles 
in Figure IB show similar total counts in WT/Resc and 
Cfpl-/- ES cells, as the massive decrease in the region 
downstream of the promoter is partially compensated for 
by an increase in the upstream part of the peak. This high- 
lights the importance of considering shape based features 
when testing for statistically significant differences as all 
three promoter regions are consistently called by MMDiff 
in both experiments, but none is called by DESeq in any 
of the experiments. 



Differential peak calling 

We used MMDiff to find peaks and promoter regions 
that are significantly different in the Cfpl-/- cell line ver- 
sus WT and Resc. To elucidate the working principles of 
MMDiff, we show in Figure 3 MMD values versus mean 
total counts for the 27,807 promoter regions. In Figure 3A, 
MMD values between Cfpl-/- and WT are shown. For 
comparison, MMD distances between Resc and WT are 
overlayed in Figure 3B. As expected from equation 2, the 
MMD value between replicates strongly depends on the 
coverage of the peak, with high enriched peaks showing 
smaller MMD values. In contrast, there is a large number 
of promoters with high coverage that have been assigned a 
large MMD value in the Cfpl-/- vs WT comparison. This 
leads to a clear separation of a group of differentially mod- 
ified promoters (DMPs) with enrichment profiles that are 
more different between Cfpl-/- and WT/Resc than can 
be explained by experimental and biological variation. In 
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Figure 3 Differential calling and reproducibility in H3K4me3 ChlP-Seq data sets. A-C MMD-based distances as a function of mean total counts 
in experiment AB.1 . Each dot represents one examined promoter. A MMD values computed between Cfpl -/- and WT. B MMD determined between 
Resc and WT overlayed in black. These provide a measure of the biological and experimental variability. C Plots are overlayed and promoters that are 
significantly different in Cfpl-/- versus W/Resc {FDR < 0.05) are shown in red. D-E MA plot representations of the same data showing smooth 
scatter plots of log2 fold changes versus mean normalised counts. The red dots mark promoters detected as differentially modified (DMPs) at a 5% 
false discovery rate. D DMPs according to MMDiff and E according to DESeq. F Reproducibility of differential calling across experiments AB.1 and 
AB.2. DESeq and MMDiff are compared both for differentially called promoters (left) and for MACS consensus peaks. 
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Figure 3C 2022 promoters with a FDR < 0.05 are marked 
in red. 

The fact that most DMPs appear to have large mean 
total counts may partly be due to the fact that most 
changes appear at promoters that are strongly enriched 
in H3K4me3 under normal conditions [24], and partly 
because the peaks with low total counts are more dom- 
inated by noise and do not exhibit a conserved shape 
between WT and Resc. While total counts are not used as 
a discriminating feature by MMDiff, it is also interesting to 
see that most DMPs exhibit a change in total counts as is 
elucidated in an MA-plot (Figure 3D), where fold change 
is plotted versus mean total counts. As expected, the great 
majority of DMPs lose H3K4me3 as a consequence of 
Cfpl depletion. 

DESeq calls 1491 promoters to be significantly different 
between Cfpl-/- and WT/Resc. Interestingly, the overlap 
between DMPs called by MMDiff and DESeq is small; only 
584 promoters are called by both methods and the dif- 
ference between these methods becomes apparent when 
comparing the respective MA- plots: To call a region dif- 
ferential, DESeq requires a large fold change even for 
promoters with large coverage (Figure 3E). On the con- 
trary MMDiff is confident in calling regions differential 
based on different shapes even when the fold change is 
small, provided that shapes are conserved between repU- 
cates. Examples for those promoters are given in Figure 1, 
which have all been called by MMDiff but not DESeq. On 
the other hand, DESeq calls a number of DMPs which 
have relative low coverage (between 50 and 1000 counts 
in a 4kb window) but relatively high fold change. These 
promoters are practically bare of H3K4me3 in WT and 
Resc, however they appear to gain a small amount of 
H3K4me3 upon Cfpl depletion as can be seen in the 
example in Figure 4A. Overall, this analysis demonstrates 
that MMDiff has a high sensitivity to detect differen- 
tial modified promoters when a reproducible profile is 
observed between WT and Resc. The low overlap between 
peaks called by MMDiff and DESeq further illustrates on 
a real data set the complementary nature of MMDiff to 
count-based methods. 

Reproducibility 

In the absence of a ground truth it is particularly difficult 
to evaluate and compare the results obtained from dif- 
ferent methods. To approach an answer to the question 
whether the called DMPs are genuine or false positives, 
we are particularly interested in two aspects: the repro- 
ducibility of the differentially called regions, and their 
biological significance. To test the first aspect, we run 
independent analyses on the data sets obtained with the 
two different antibodies, and report the overlap of peaks 
called between the two experiments. Figure 3F shows bar 
charts of promoters and MACS consensus peaks called 



by DESeq and MMDiff in the two experiments; MMD- 
iff is seen to call more promoters than DESeq, and also 
to have a larger fraction of promoters called consistently 
in both experiments. In the case of MACS consensus 
peaks, the numbers of regions called consistently in both 
experiments appear to be relatively low for both meth- 
ods (15% for DESeq and 26% for MMDiff). However, 
MMDiff again is more consistent across experiments than 
DESeq. 

This analysis demonstrates that the outcome of dif- 
ferential peak calling can vary when experimental data 
sets obtained with different antibodies are considered. 
Also, uncertainties introduced in the peak calling step 
can propagate to the differential peak calling procedure. 
To increase both, sensitivity and specificity, it is highly 
advisable to increase the number of considered repUcates. 
The analysis also shows that employing shape features as 
done with MMDiff can lead to improved robustness of the 
results. 

Changes of Pol II occupancy at Cfpl target genes 

In order to assess the biological significance of the 
observed changes, we analysed a Pol II ChlP-Seq data 
set from the same Cfpl study [24]. We now restrict 
our analysis to the promoter regions, in order to avoid 
the ambiguous assignment of peaks to genes. Using the 
pipeline described above, we investigated whether there 
are changes in Pol II binding - and thus gene transcription - 
associated with the called H3K4me3 DMPs. 

As previously reported, changes in Pol II binding fol- 
lowing Cfpl deletion appear to be modest [24] and only 
very few promoters are detected to be differentially bound 
by Pol II (9 and 24 called by DESeq and MMDiff, respec- 
tively). This is surprising given the widely accepted role of 
H3K4me3 as epigenetic mark at active promoters. A pos- 
sible explanation is that at most promoters residual levels 
of H3K4me3 remain and these basal levels may be suffi- 
cient to partially retain Pol II binding, so that changes are 
difficult to detect (see Figure 4B). A remarkable excep- 
tion is shown in Figure 4C where the H3K4me3 signal is 
completely lost at the promoter of Jade-1 which is accom- 
panied with the complete removal of Pol II binding. We 
next investigated whether there was a small but system- 
atic shift of Pol II binding associated with other H3K4me3 
DMPs. Figure 4D and E show MA-plots for the Pol II data 
set, with DMPs determined on the H3K4me3 set shown in 
red, and Figure 4F shows the distribution of fold changes 
in Pol II binding between Cfpl-/- and WT/Resc. We see 
a clear down-regulation of genes associated with DMPs 
called by MMDiff {p < 10~^^, Wilcoxon rank sum test^). 
In the case of DMPs called by DESeq, the distribution also 
has a mean significantly different from zero, but appears 
highly non-Gaussian. This is consistent with the finding 
that DESeq calls a number of small 'ectopic' promoters 
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Figure 4 Changes of H3K4me3 levels are correlated with changes in Pol II binding. A-C Example DMPs at three annotated genes, showing 
H3K4me3 patterns and Pol II binding profiles. Input is shown as dashed, black lines. A Promoter called by DESeq but not MMDiff showing an 
increased H3K4me3 peak in the Cfpl -/- sample. B Promoter called by MMDiff but not DESeq with substantial decrease in H3K4me3 and modest 
change in Pol II binding. C Promoter of Jade-1 showing complete loss of H3K4me3 accompanied with elimination of Pol II binding (called by both). 
D, E MA-plots of Pol II binding. Promoters with significant differential H3K4me3 patterns are marked with red dots: D DMPs according to DESeq and 
E DMPs according to MMDiff. F Distribution of observed fold changes in Pol II binding (Cfpl-/- versus W/Resc). black: all promoters, red: DMPs 
detected by MMDiff (Wilcoxon rank sum test, p-value < 10~^^). blue: DMPs detected by DESeq: p-value < 10~^^. 



which are bare of H3K4me3 in WT but gain H3K4me3 in 
the absence of Cfpl which is accompanied with very low 
levels of Pol II binding in Cfpl-/- cells (see Figure 4A). 
This analysis demonstrates that differences in H3K4me3 
detected by MMDiff correlate well and consistently with 
subtle changes of Pol II binding, lending further evidence 
to the high quality of MMDiff results. It also shows that 
the relationship between H3K4me3 modifications and Pol 
II binding is more complex than expected, showing highly 
non-linear behaviour. 

Functional annotation of Cfpl target genes 

We have observed that Cfpl substantially affects the 
H3K4me3 levels at a large number of promoters and we 
next asked whether it specifically targets genes which 
share particular functional pathways. We performed an 
enrichment analysis for gene ontology (GO) terms using 
the Ontologizer package [36]. As a study set we used a 
set of 759 genes associated with differential promoters 
detected by MMDiff in both experiments (AB.l and AB.2) 
and which showed a decrease in H3K4me3 upon deple- 
tion of Cfpl-/- and similarly for DESeq (322 genes). These 
two sets were contrasted with a population set consist- 
ing of 11,459 genes that showed substantial enrichment 
for H3K4me3 in WT and Resc in both experiments. Inter- 
estingly, despite the small overlap between the MMDiff 
set and the DESeq set (only 18% of the combined set 
are shared), 9 out of the 10 most enriched GO terms 
are consistent between the two sets: These GO terms 
include 'RNA processing', 'RNA binding', ribonucleopro- 
tein complex biogenesis', structural constituents of ribo- 
somes' and ribonucleoprotein complex', which were all 
highly enriched in the downregulated DMP sets (adjusted 
p-values < 10~^). In the MMDiff set, genes annotated 
with 'translation' are also highly overrepresented. These 
findings are in very good agreement with the phenotype 
of Cfpl depletion in ES cells, where global protein syn- 
thesis is strongly affected by a reduced abundance of free 
ribosomes [37]. To avoid detection biases, we illustrate 
the clustering of functionally related genes graphically by 
annotating genes in the H3K4me3 MA-plot (Figure 5A). 
We find that indeed the majority of promoter regions 
of 928 genes associated with RNA binding and process- 
ing, translation and structural constituents of ribosomes 
are clustering together showing a substantial decrease 
of H3K4me3 levels. The most drastic changes can be 



observed in genes which are structural constituents of 
ribosomes. This trend is also observable in the Pol II 
MA-plot (Figure 5B). In this case, individual fold changes 
are much smaller, as discussed above, however, a large 
number of ribosomal RNAs or proteins are affected. The 
cooperative impact of a large number of small effects on 
genes involved in the same functional mechanisms may 
well explain the phenotype of reduced protein synthesis 
in Cfpl-/- ES cell lines [37]. We conclude that changes in 
the H3K4me3 level detected by MMDiff are likely to play 
functionally important biological roles. 

Co-occurring transcription factor binding sites 

We next examined the sequence composition of promot- 
ers with H3K4me3 profile changes in order to improve 
our understanding of the Cfpl binding mechanisms. We 
used the MEME suite to find overrepresented sequence 
motifs in putative Cfpl target promoters as detected by 
MMDiff [38]. Again, we used the subset of 11,458 pro- 
moters with significant H3K4me3 enrichment to create a 
background model (Markov model of order 6). Among the 
top ten discovered motifs we found four binding motifs 
of the activating E2F family transcription factors, E2F2 
and E2F3, with p-values < 10~^^ (see Figure 5C). This 
finding is in good agreement with recent data suggest- 
ing that the HMTs MLL2 and Setl directly associate with 
E2F transcription factors [39,40] and indirect DNA bind- 
ing of Cfpl via E2F TFs might be the explanation to why 
a DNA binding deficient Cfpl mutant has been shown 
to be able to rescue reduced levels of H3K4me3 at most 
affected promoter regions [24] . We conclude that MMDiff 
is a powerful tool to promote the identification of tran- 
scription factor motifs and potential co-factors which play 
important roles in targeting HMTs to gene promoters. 

Cluster analysis of peaks 

Next, we set out to identify common patterns in H3K4me3 
profiles and asked whether promoters with similar profiles 
were also affected in a comparable way by Cfpl deple- 
tion. This approach is motivated by the idea that different 
clusters encoding different shapes might reflect differ- 
ent binding mechanisms or different control functions. In 
addition, we asked if Cfpl depletion had a homogeneous 
effect on all TSS sites, or if differences might depend on 
the shape observed in WT itself. Similar to [41], we per- 
formed a cluster analysis on the peak histograms derived 
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from the WT sample, using a Gaussian Mixture Mociel 
(GMM) with covariances constrained to be diagonal in 
order not to overfit^. We ran GMM multiple times for 
different cluster numbers and used the Bayesian Informa- 
tion Criterion (BIG) to determine the appropriate number 
of clusters. We observed a minimum of BIG at /c = 18 
clusters, which proved to be robust against different ini- 
tialisations of the algorithm, and the same minimum was 
found both in the WT and Resc data sets. 

Figure 6A presents a heat map visualisation of the clus- 
tering results. Average H3K4me3 and Pol II profiles for 
three clusters are shown in Figure 6B and G. Remarkably, 
genes within the same H3K4me3 cluster also appear to 
have distinctive Pol II profiles and wider H3K4me3 peaks 
are reflected in broader binding of Pol II. 

We further investigated the relationship between dif- 
ferential histone modification and shape clustering by 
analysing how the detected DMPs are distributed over the 
clusters, see Table 2. We find that clusters 12, 14-16, and 
18 are highly significantly (p<0.001) enriched for DMPs. 
In contrast, we detected far fewer differential peaks than 



expected under the null hypothesis in clusters 1-6 and 
8. To assess the significance of this clustering, we report 
the mean fold change in Pol II counts for genes within 
each cluster. We see that clusters which are enriched 
for differential H3K4me3 patterns systematically have a 
decrease in Pol II, while clusters which are unaffected by 
the Gfpl deletion seem to have rather stable Pol II levels. 
Again, it can be observed that H3K4me3 profile shapes are 
highly informative and differences in these shapes likely 
encode different mechanisms for the establishment of this 
important epigenomic marker. 

Application 2: H3K27ac 

To further explore the broader applicability of MMD- 
iff, we applied it to a H3K27ac ENGODE data set. 
This epigenomic mark is known to localize around 
enhancer elements and distinguishes active enhancers 
from poised ones [42]. Here we compare two human 
cell lines, K562, an immortalised myelogenous leukemia 
line, and GM12878, a lymphoblastoid cell line. We use 
two replicates per cell line and analyzed 69,577 regions 
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derived from the respective ENCODE broadPeak files [43] 
after merging overlapping peaks [7]. Using DESeq, 25% 
(18,080) of all peaks appear to be differential between the 
two cell lines. With MMDiff we only detect 5631 changes, 
of which 1827 are unique to MMDiff. Figure 7 A shows 
a typical example region which was detected by DESeq 
but not MMDiff. It is apparent that, despite a large fold 
change, the shapes of the peaks are very similar in the 
two cell lines. In contrast Figure 7B and 7C show exam- 
ple regions detected by MMDiff and not DESeq. In this 
case the number of reads mapping to the whole region 
is very similar in the two cell lines. However, there are 



sharp, well localized peaks in the K562 cell line, while 
broad regions of low enrichment in the GM12878 cell line. 
In summary, large fold changes seem to be prevalent in 
this comparison, however some profile changes are also 
present which can be picked up by MMDiff. 

Application 3: CTCF binding 

Finally, we tested MMDiff on a ChlP-Seq data set measur- 
ing the genome-wide binding of the transcription factor 
CTCF. CTCF is a transcriptional repressor which also 
plays a fundamental role in regulating the 3-D structure 
of chromatin [44]. As such, it has been widely studied in 
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recent years, with several ChlP-Seq experiments identify- 
ing thousands of binding sites across the genome. 

Here we used an ENCODE CTCF ChlP-Seq data set 
consisting of two repUcates from three mouse tissues; cor- 
tex, cerebellum and liver [28]. The choice of tissues was 
deliberately heterogeneous to check the ability of MMDiff 
to identify both subtle changes (as expected between cor- 
tex and cerebellum) and more marked changes between 
brain tissues and liver. We used the provided broadPeaks 
files and after merging overlapping peaks we obtained 
49,762 sites for further analysis. Once again, we compared 
the results of MMDiff and DESeq across pairwise compar- 
isons between tissue types: cortex vs liver (CL) and cortex 
vs cerebellum (CC). Using a threshold ofp < 0.05 for dif- 
ferential peak calling, MMDiff identified 2145 differential 
peaks in CL and 442 in CC, with DESeq identifying 2052 
in CL and 46 in CC respectively. The overlap between 
peaks called by the two methods is limited, with 606 
peaks called by both in CL and only 15 in CC, further 
demonstrating the complementarity of the two methods. 
As expected, fewer differences were called by both meth- 
ods in CC as opposed to CL; Figure 7D shows an example 
of a peak detected by DESeq in both CL and CC and 
not called by MMDiff. Again, the peak has a very similar 
profile in all samples while the total counts vary greatly 
between tissues. In contrast, two example peaks called by 
MMDiff and not by DESeq are shown in Figure 7E and 
F. In Figure 7E, CTCF seems to be bound at two dis- 
tinct binding sites in the cortex and liver. However, in the 
cerebellum, one of these sites appears to be vacant. It is 
noteworthy, that this change might have been detected 
by count-based methods if more stringent regions around 
the two binding sites had been considered. These meth- 
ods are therefore more depending on the peak calling and 
peak merging processes*^. Figure 7F shows two binding 
sites which are less than 200bp apart. In this case, CTCF 
is bound at both sites in cerebellum and liver and occupies 



only one binding site in the cortex sample. As illustrated 
by this example, MMDiff is capable of directly detecting 
changes at homotypic binding events at neighbouring 
binding sites. 

The R package MMDiff 

These applications show that our method is generic 
enough to be used in the analysis of a wide range 
of ChlP-Seq data sets which capture other epigenomic 
marks or (broad) binding patterns of DNA-associated 
proteins. It is now available as a Bioconductor R pack- 
age (package MMDiff), with complete documentation 
and examples. Additional updates are also available 
from the project webpage [http://homepages.inf.ed.ac.uk/ 
gschweik/MMDiff.html] . 

Conclusions 

ChlP-Seq is one of the most widely employed experi- 
mental techniques in functional genomic and epigenomic 
studies, yet statistical analysis of ChlP-Seq data still poses 
many challenges. In this paper, we address the prob- 
lem of statistical testing in ChlP-Seq data sets, and pro- 
pose a non-parametric methodology which is capable of 
accounting for the highly structured nature of this type 
of data. Compared with techniques based on total counts, 
MMDiff can identify localised changes which alter the 
shape of a peak. The identification of such changes is 
particularly relevant in the light of recent findings that 
suggest a functional significance of the shape of histone 
modifications. For example, an analysis of H3K27me3 pat- 
terns around CTCF peaks, carried out as part of the 
ENCODE project, reported that the observed asymmet- 
ric shapes of H3K27me3 support the role of CTCF sites 
in delimiting active and polycomb-silenced domains [1]. 
Furthermore, chromatin signatures have recently been 
associated with other biologically relevant features such as 
first exon length [20]. MMDiff 's ability to capture shape 
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Figure 7 Differential peak calling in H3K27ac and CTCF data sets. A-C: Three example H3K27ac peaks called as differential when comparing 
human K562 and GMl 2878 cell lines (data from ENCODE consortium). D-F Three example CTCF peaks in samples derived from mouse cortex, 
cerebellum and liver. Peaks are called differential in the cortex vs cerebellum comparison. Black and red bars demark CTCF motifs on the forward 
and reverse strand, respectively. Peaks shown in A, D) are called by DESeq only; peaks in B, C, E, F are called by MMDiff only. 



changes in peaks may therefore enable the analysts to cap- 
ture functionally significant changes in patterns of histone 
modifications or transcription factor binding which would 
not be retained by methods which only use total counts for 
testing. From the practical point of view, focusing on peak 
shape largely circumvents problems arising from choosing 
the right normalisation, and MMDiff is also independent 
of the definition of a suitable noise model 

Methodologically, MMDiff belongs to the family of 
Kernel based methods; these have a long history in bioin- 
formatics, and have had a considerable influence in the 



analysis of high throughput sequencing data. An approach 
which is related to ours has been recently proposed for 
the purpose of alternative isoform detection from RNA- 
Seq data [23]. While the methodology proposed in that 
paper also relies on MMD, the application domain is 
significantly different, as is the treatment of biological 
noise. 

In the context of ChlP-Seq data, our empirical results, 
both on simulations and on three independent data 
sets, demonstrate that our approach is complementary 
to count-based methods such as DESeq. A practically 
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advisable strategy may be to couple the two methods 
within an analysis pipeline, allowing analysts to detect 
both peaks that change in shape and peaks that only 
exhibit changes in total counts of reads, while maintaining 
the overall shape of the peak. As for all statistical testing 
methods, it is worthwhile to emphasize that multiple bio- 
logical replicates are necessary to get a reliable estimate of 
the biological variance. 

To strengthen our claim that our approach can provide 
a different perspective in the analysis of ChlP-Seq data, 
and can be an effective tool for hypothesis generation, we 
have carried out an in-depth analysis of results of using 
MMDiff on the data presented in [24]. We demonstrated 
that MMDiff reproducibly yields biologically meaningful 
results. We were able to suggest mechanisms that link 
molecular observations of altered H3K4me3 patterns to 
phenotypes observed in Cfpl-/- ES cells [37]. In particular, 
we find that a large number of genes playing a functional 
role in protein synthesis are potentially targeted by Cfpl. 
Effects on Pol II binding - and thus potentially transcrip- 
tion - at each individual affected gene seem to be very 
small; however, taking all affected genes together, we find 
a significant decrease of Pol II binding at these genes 
which is in agreement with the observation that Cfpl-/- 
ES cells show a reduction in translation initiation. Further- 
more, the mild effect of Cfpl deletion on Pol II binding 
at most promoters is in strong contrast to the observation 
at the promoter of Jade-1. Here, the lack of H3K4me3 in 
the Cfpl depleted cell leads to a complete abolishment of 
Pol II binding. In this specific case, H3K4me3 seems to 
act as a switch directly regulating primary transcriptional 
mechanisms. Jade-1 is of particular interest as it is a key 
player in H4 acetylation at active genes [45] . It was earlier 
shown that in the presence of the human tumour suppres- 
sor proteins ING4 and ING5, Jade-1 targets the chromatin 
through interaction with H3K4me3 modifications [46]. 
Our finding may therefore hint to an epigenomic feed- 
forward loop based on cross-talk between H4 acetylation 
and H3K4 methylation. 

Our results demonstrate the potential of non- 
parametric kernel methods to lead to novel biological 
insights from the analysis of ChlP-Seq data. It is an inter- 
esting direction for further research to investigate how 
the structured nature of NGS data can be exploited in 
predictive models for more general tasks than statistical 
testing. 

Endnotes 

^ alternative hypothesis "true location ^ to 0". 

^ About half of the TSSs were discarded prior to the 
analysis due to the absence of H3K4me3 enrichment. 
Additionally, regions overlapping with more than one 
TSS were excluded resulting in a set of 4148 promoter 
regions. 



^ Also note, that the high spatial resolution of the peaks 
is achieved by showing histograms of the corrected 
midpoints of the reads as opposed to coverage plots. 
Corresponding UCSC Genome Browser views are shown 
in the Additional file 1. 

Additional file 



Additional file 1 : Supplementary information. Contains all 
supplementary notes and supplementary figures. 
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